We are tasked with using the used cars dataset from Kaggle, collected by Austin Reese, which contains 45K+ used car seller entries from Craigslist to build a model of entry prices. This data includes useful variables such as the company and make of the car, the drive, type, transmission, colour and location and so on and so forth.
The goal of the project is to predict the entry price by we need to exploring visualisations, observations, and eventually, a regression model that can learn from this data to make the prediction using supervised learning. Understanding this prediction will help individuals and businesses how much a car will be valued after years of use and depreciation, information that could help in the consideration when purchasing new cars among other use cases.
import opendatasets as od
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import *
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import f_regression, SelectKBest
from scipy.stats import uniform, truncnorm, randint
from sklearn import metrics
from sklearn.metrics import mean_absolute_error as mae
from joblib import dump, load
from sklearn.ensemble import RandomForestRegressor
od.download("https://www.kaggle.com/austinreese/craigslist-carstrucks-data")
Skipping, found downloaded files in "./craigslist-carstrucks-data" (use force=True to force download)
Let's understand what the exact columns are and what their corresponding values look like.
df = pd.read_csv("/Users/mehervaswani/craigslist-carstrucks-data/vehicles.csv")
df.head()
| id | url | region | region_url | price | year | manufacturer | model | condition | cylinders | ... | size | type | paint_color | image_url | description | county | state | lat | long | posting_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7222695916 | https://prescott.craigslist.org/cto/d/prescott... | prescott | https://prescott.craigslist.org | 6000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | az | NaN | NaN | NaN |
| 1 | 7218891961 | https://fayar.craigslist.org/ctd/d/bentonville... | fayetteville | https://fayar.craigslist.org | 11900 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | ar | NaN | NaN | NaN |
| 2 | 7221797935 | https://keys.craigslist.org/cto/d/summerland-k... | florida keys | https://keys.craigslist.org | 21000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | fl | NaN | NaN | NaN |
| 3 | 7222270760 | https://worcester.craigslist.org/cto/d/west-br... | worcester / central MA | https://worcester.craigslist.org | 1500 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | ma | NaN | NaN | NaN |
| 4 | 7210384030 | https://greensboro.craigslist.org/cto/d/trinit... | greensboro | https://greensboro.craigslist.org | 4900 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | nc | NaN | NaN | NaN |
5 rows × 26 columns
df.info()
df.nunique(axis=0)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 url 426880 non-null object 2 region 426880 non-null object 3 region_url 426880 non-null object 4 price 426880 non-null int64 5 year 425675 non-null float64 6 manufacturer 409234 non-null object 7 model 421603 non-null object 8 condition 252776 non-null object 9 cylinders 249202 non-null object 10 fuel 423867 non-null object 11 odometer 422480 non-null float64 12 title_status 418638 non-null object 13 transmission 424324 non-null object 14 VIN 265838 non-null object 15 drive 296313 non-null object 16 size 120519 non-null object 17 type 334022 non-null object 18 paint_color 296677 non-null object 19 image_url 426812 non-null object 20 description 426810 non-null object 21 county 0 non-null float64 22 state 426880 non-null object 23 lat 420331 non-null float64 24 long 420331 non-null float64 25 posting_date 426812 non-null object dtypes: float64(5), int64(2), object(19) memory usage: 84.7+ MB
id 426880 url 426880 region 404 region_url 413 price 15655 year 114 manufacturer 42 model 29667 condition 6 cylinders 8 fuel 5 odometer 104870 title_status 6 transmission 3 VIN 118264 drive 3 size 4 type 13 paint_color 12 image_url 241899 description 360911 county 0 state 51 lat 53181 long 53772 posting_date 381536 dtype: int64
From these few functions, we observe a total of 426,880 entries. Only id, url, region, region_url, price and state, however, do not contain any missing values. The remaining values contain missing or insufficient values.
We have several numerical and categorical variables ("object") include condition, cylinder, fuel type, transmission type, drive type, size of car, paint_color, car type, and states (51 - USA).
df.describe()
| id | price | year | odometer | county | lat | long | |
|---|---|---|---|---|---|---|---|
| count | 4.268800e+05 | 4.268800e+05 | 425675.000000 | 4.224800e+05 | 0.0 | 420331.000000 | 420331.000000 |
| mean | 7.311487e+09 | 7.519903e+04 | 2011.235191 | 9.804333e+04 | NaN | 38.493940 | -94.748599 |
| std | 4.473170e+06 | 1.218228e+07 | 9.452120 | 2.138815e+05 | NaN | 5.841533 | 18.365462 |
| min | 7.207408e+09 | 0.000000e+00 | 1900.000000 | 0.000000e+00 | NaN | -84.122245 | -159.827728 |
| 25% | 7.308143e+09 | 5.900000e+03 | 2008.000000 | 3.770400e+04 | NaN | 34.601900 | -111.939847 |
| 50% | 7.312621e+09 | 1.395000e+04 | 2013.000000 | 8.554800e+04 | NaN | 39.150100 | -88.432600 |
| 75% | 7.315254e+09 | 2.648575e+04 | 2017.000000 | 1.335425e+05 | NaN | 42.398900 | -80.832039 |
| max | 7.317101e+09 | 3.736929e+09 | 2022.000000 | 1.000000e+07 | NaN | 82.390818 | 173.885502 |
From this basic statistical description, we can observe the following of the numerical variables:
- The price has a range of 0 - 4b, which means some cars were sold for free and some cars were sold at extremely hefty prices. This is an outlier that will need to be handled.
- The original cars purchased date all the way back to 1900.
- The odometer has a range of 0 to (1 * 10^7)m which is also an outlier.
- The county contains only 0s.
We can start to do some cleaning and transformation to filter for signals among the noise. Firstly, we can logically remove variables we don't need simply because the success of our model relies on selecting a good set of relevant features ("feature engineering"), which involves selecting the most useful features and/or combinining features to become more useful. We also plot some visualisations to understand linearity and the relationship between the dependent variables and the independent variables.
#> Remove variables we don't need based on common sense ('id', 'url', 'region', 'region_url', 'title_status', 'VIN','image_url', 'description', 'county', 'posting_date').
df = df.drop(columns = ['id', 'url', 'region', 'region_url', 'title_status', 'VIN','image_url', 'description', 'county', 'posting_date'])
Secondly, we clean up the outliers in price, year and odometer because these make it difficult for the model to detect underlying patterns. Rather than discarding and losing data, however, we choose to fix the errors manually.
sns.distplot(df['price'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='price', ylabel='Density'>
fig, ax = plt.subplots(figsize=(12,4))
ax.set_title('Box Whisker Plot to Identify Outliers in Prices')
sns.boxplot(x= df['price'])
<AxesSubplot:title={'center':'Box Whisker Plot to Identify Outliers in Prices'}, xlabel='price'>
We can identify outliers from the skewed values on the left and right of the graph. Some of these extreme values make it even harder to see the remainder of the values. We remove these outliers by using the interquartile range. These extreme values are possible because the data is scraped from real-world entries where typos in the entries are likely.
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3-Q1
filtered_df = (df['price'] >= Q1 - 1.5 * IQR) & (df['price'] <= Q3 + 1.5 * IQR)
old_size = df.count()['price']
df = df.loc[filtered_df]
new_size = df.count()['price']
print(old_size-new_size, '(', '{:.2f}'.format(100*(old_size-new_size)/old_size), '%',')', 'outliers removed from dataset')
8177 ( 1.92 % ) outliers removed from dataset
With almost a 2% data loss, we get a clearer distribution of the prices. We can see there is still a large number of free cars (price = 0), but since this is still a possibility in the real world, we keep it instead of removing it.
sns.distplot(df['price'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='price', ylabel='Density'>
The odometer also has outliers. Similar to the outliers in price, this could be a result of typing errors or simply mistakes in the samples. Lesser mileage is a possibility and so are higher mileages.
plt.figure(figsize=(20,10))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
ax = sns.scatterplot(x=df['odometer'], y=df['price'])
fig, ax = plt.subplots(figsize=(12,4))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
sns.boxplot(x= df['odometer'])
<AxesSubplot:title={'center':'Box Whisker Plot to Identify Outliers in Odometer'}, xlabel='odometer'>
Since our outliers begin after 0.375, we can drop the values that exceed that. We use the IQR method to narrow down the range again.
Q1 = df['odometer'].quantile(0.25)
Q3 = df['odometer'].quantile(0.75)
IQR = Q3-Q1
filtered_df = (df['odometer'] <= Q3 + 3 * IQR)
old_size = df.count()['odometer']
df = df.loc[filtered_df]
new_size = df.count()['odometer']
print(old_size-new_size, '(', '{:.2f}'.format(100*(old_size-new_size)/old_size), '%',')', 'outliers removed from dataset')
1531 ( 0.37 % ) outliers removed from dataset
plt.figure(figsize=(20,10))
ax.set_title('Box Whisker Plot to Identify Outliers in Odometer')
ax = sns.scatterplot(x=df['odometer'], y=df['price'])
sns.distplot(df['odometer'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='odometer', ylabel='Density'>
We can observe an inverse relationship whereby cars with higher mileage cost less and vice versa. This makes sense in the real world. While there are many cars that were sold for free and have 0 mileage, we will keep these values because there may be exceptional situations in the the real world where used/new cars are given for free / for very little (who wouldn't want to live in a world like this). This also allows us to cater for a variety of cases and to prevent overfitting our model. We will narrow the dataset simply by focusing on removing the extreme values that skew the dataset.
We also want to set our range for the year to the last 50 years (1970-2020) to minimise the instability in the prices. Cars that were bought earlier than 50 years ago may no longer be representative of the market today because some of them may even be considered collectibles. After narrowing this range, we can observe a positive correlation between the year and the price.
df = df[df['year'].between(1970,2021)]
sns.distplot(df['year'])
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='year', ylabel='Density'>
fig, ax= plt.subplots(figsize=(45,20), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['year'], y=df['price'], ci = None, palette='magma')
<AxesSubplot:xlabel='year', ylabel='price'>
Finally, we can plot a correlation matrix to identify the strengths between all the numerical variables as compared to the entry price.
corr = df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr, annot=True, vmin=-1,vmax=1)
plt.show()
fig, ax = plt.subplots(figsize=(35,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['year'], y=df['odometer'], ci=None, palette='rocket')
<AxesSubplot:xlabel='year', ylabel='odometer'>
There seems to be a correlation between year and odometer, likely because the older cars would increase in mileage as the years increased but newer model cars have not cumulated as much mileage. While we can deep dive into whether this colinearity affects our model by calculating the Variance Inflation Factors (VIF), and if severe enough then we will have to drop one of the two variables but for the purposes of this project, we will assume that the VIF values are within acceptable bounds.
Next, we can summarise and visualise the categorical variables to understand their relationship with the target variable.
fig, axes = plt.subplots(1,2,figsize=(25,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['condition'], y=df['price'],ci=None, color='lightcoral', ax=axes[0])
sns.barplot(x= df['transmission'], y=df['price'], ci=None, color='lightgreen',ax=axes[1])
<AxesSubplot:xlabel='transmission', ylabel='price'>
Cars in better condition sell for better than cars in poorer conditions. In terms of tranmission, automatic cars compare better than manual albeit not significantly. These transmission types however pale in comparison to the 'other', which could include Continuously Variable Tranmission (CVT) or semi-automatic or dual clutch tranmission types.
fig, axes = plt.subplots(1,3,figsize=(35,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['paint_color'], y=df['price'], ci=None, palette='rocket', ax=axes[0])
sns.barplot(x= df['drive'], y=df['price'], ci=None, palette='mako', ax=axes[1])
sns.barplot(x= df['fuel'], y=df['price'], ci=None, palette='viridis', ax=axes[2])
<AxesSubplot:xlabel='fuel', ylabel='price'>
There is an assumption that some colors cost more than others and while this is the case for cars that are white, black, red or orange, there is less variance in prices than expected.
The 4-wheel drive also performs slightly better than rear-wheel drives while front-wheel drives are more affordable. If we compare the drive type to the year, we can see that rwds were the default option until the 4wd gained more popularity since the 90s.
Diesel fuelled cars cost more than gas and hybrid options likely because their engines are typically more expensive and are used by larger sized vehicles. Electric cars cost almost equally as much as diesel.
fig, axes = plt.subplots(1,3,figsize=(35,10), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['size'], y=df['price'], ci=None, palette='rocket', ax=axes[0])
sns.barplot(x= df['type'], y=df['price'], ci=None, palette='mako', ax=axes[1])
sns.barplot(x= df['cylinders'], y=df['price'], ci=None, palette='viridis', ax=axes[2])
<AxesSubplot:xlabel='cylinders', ylabel='price'>
Larger cars tend to be more expensive and this can be explained by the type of cars. Pickups, trucks, SUVs, and coupes are likely to be more expensive because of the manufacturers and the size. Surprisingly large vehicles like buses, mini-vans, and wagons do not necessarily cost more but this is likely because they are used rather than new. We could compare this with the year to identify any correlation between age and price.
Cars with 6, 8 or 10 cylinders tend to be more expensive, while 4 and 5-cylinder cars are cheaper.
fig, ax = plt.subplots(figsize=(35,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['manufacturer'], y=df['price'], ci=None, palette='rocket')
<AxesSubplot:xlabel='manufacturer', ylabel='price'>
We can observe a correlation between price and high-end car manufacturer such as tesla, jaguar, porsche, rover, aston martin, audi, etc. These car manufacturers also indicate the presence of outliers potentially because these manufacturers produce car models that cost significantly more than others. Nonetheless, a large proportion of the data set is monopolised by low to medium budget car manufacturers.
fig, ax = plt.subplots(figsize=(25,15), sharey=True)
fig.suptitle('Visualise Categorical Columns')
sns.barplot(x= df['state'], y=df['price'], ci=None, palette='rocket')
<AxesSubplot:xlabel='state', ylabel='price'>
df.plot(kind="scatter", x="long", y="lat", alpha=0.4, s=df["price"]/1000000, label="price", figsize=(10,7),c="price", cmap=plt.get_cmap("jet"), colorbar=True, )
plt.xlim(-150,0)
plt.legend()
<matplotlib.legend.Legend at 0x32fd73b20>
df = df.drop(columns=['lat','long'])
We can further deep dive into this data to answer some business questions and gain insights to what factors significantly influence the target variable.
fig, ax = plt.subplots(figsize=(25,15), sharey=True)
fig.suptitle('How is condition influenced by mileage?')
sns.barplot(x= df['condition'], y=df['odometer'], ci=None, palette='rocket')
<AxesSubplot:xlabel='condition', ylabel='odometer'>
%matplotlib inline
fig, ax = plt.subplots(figsize=(35,10), sharey=True)
fig.suptitle('What types of cars does each manufacturer sell most?')
sns.histplot(binwidth=0.2, x=df['manufacturer'], hue=df['type'], data=df, stat="count", multiple="stack")
<AxesSubplot:xlabel='manufacturer', ylabel='Count'>
fig, ax = plt.subplots(figsize=(35,20), sharey=True)
Company_Kilometers_Driven = df.groupby('manufacturer').odometer.mean()
Company_Kilometers_Driven.plot(kind='bar')
plt.xlabel("s")
plt.ylabel("s")
plt.title("What is the average mileage of the car before it is sold?")
plt.show()
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Is there a preference for what kind of drive is chosen each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'drive')
<AxesSubplot:title={'center':'Is there a preference for what kind of drive is chosen each year?'}, xlabel='year', ylabel='price'>
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('What type of car sells most each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'type')
<AxesSubplot:title={'center':'What type of car sells most each year?'}, xlabel='year', ylabel='price'>
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Has there been a change in fuel prices over the past few years?')
sns.scatterplot(x='year', y='price', data=df, hue = 'fuel')
<AxesSubplot:title={'center':'Has there been a change in fuel prices over the past few years?'}, xlabel='year', ylabel='price'>
fig, ax = plt.subplots(figsize=(20,15))
ax.set_title('Is there a preference for what kind of drive is chosen each year?')
sns.scatterplot(x='year', y='price', data=df, hue = 'transmission')
<AxesSubplot:title={'center':'Is there a preference for what kind of drive is chosen each year?'}, xlabel='year', ylabel='price'>
Based on the situation that there are plenty of null values in our dataset and these missing values are hard to fill with accurate guesses. Since none of the numerical variables have missing values, we can focus on adjusting only the categorical variables. To do this we take the following three actions:
For columns that have > 40% in missing values, we remove the whole column because too many missing values .
For columns that have < 40%, we replace the values by categorising it as 'other' instead of removing more data.
plt.figure(figsize=(10,6))
sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25)
<seaborn.axisgrid.FacetGrid at 0x375401330>
<Figure size 720x432 with 0 Axes>
null_values = df.isna().sum()
def na_filter(na, threshold = 0.4):
column = []
for i in na.keys():
if na[i]/df.shape[0] < threshold:
column.append(i)
return column
df = df[na_filter(null_values)]
df.columns
Index(['price', 'year', 'manufacturer', 'model', 'fuel', 'odometer',
'transmission', 'drive', 'type', 'paint_color', 'state'],
dtype='object')
Remaining columns after removing any column with >40% missing values. We now replace remaining missing values with 'other'
plt.figure(figsize=(10,6))
sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25)
<seaborn.axisgrid.FacetGrid at 0x3aceb4580>
<Figure size 720x432 with 0 Axes>
df = df.replace(np.nan, 'other', regex=True)
Check for any final missing values.
plt.figure(figsize=(10,6))
sns.displot(
data=df.isnull().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25)
<seaborn.axisgrid.FacetGrid at 0x3acc27340>
<Figure size 720x432 with 0 Axes>
Let's take a look at what our data looks like and how many unique values each variable contains.
df.head()
| price | year | manufacturer | model | fuel | odometer | transmission | drive | type | paint_color | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 33590 | 2014.0 | gmc | sierra 1500 crew cab slt | gas | 57923.0 | other | other | pickup | white | al |
| 28 | 22590 | 2010.0 | chevrolet | silverado 1500 | gas | 71229.0 | other | other | pickup | blue | al |
| 29 | 39590 | 2020.0 | chevrolet | silverado 1500 crew | gas | 19160.0 | other | other | pickup | red | al |
| 30 | 30990 | 2017.0 | toyota | tundra double cab sr | gas | 41124.0 | other | other | pickup | red | al |
| 31 | 15000 | 2013.0 | ford | f-150 xlt | gas | 128000.0 | automatic | rwd | truck | black | al |
df.nunique(axis=0)
price 14309 year 52 manufacturer 43 model 26959 fuel 5 odometer 102580 transmission 3 drive 4 type 13 paint_color 13 state 51 dtype: int64
Within categorical variables, we see that there are 26959 unique entries under model. We'll drop this column from our dataset because the large number of distinct values increases the dimensionality in our dataset especially when we proceed to encoding our categorical variables. It would only increase our dataset by 26959 variables. Additionally, since we have 'manufacturer', whereby one model can only belong to one manufacturer, a certain element of multicolinearity will be reintroduced as there is surely a relationship between car's make and its model.
df = df.drop(columns=['model'])
Since we can only select four predictor variables, we will use year, odometer, manufacturer, and type. The remaining columns will be dropped. This combination of variables was selected based on a trial and error to see which combination of variables will yield greater accuracy but since accuracy is not the goal for this practical, we will resume at this combination which yields a satisfactory result as we will see later.
df = df.drop(columns=['transmission','fuel','state','paint_color','drive'])
'Condition' was the only variable that was ordinal but since we removed it for having >40% missing values, the remaining variables are nominal (order is irrelevant). For this reason, we create dummy variables for one-hot encoding of categorical variables instead of any other form of encoding. This prevents the model from identifying any order between the values of the variables.
catColumns = ['manufacturer','type']
for column in catColumns:
column = pd.get_dummies(df[column],drop_first=True)
df = pd.concat([df,column],axis=1)
df = df.drop(columns = catColumns)
While we could choose a stratified sampling split, we select a randomised sampling split using scikit's train_test_split() because there is not sufficient information known about the demographic of our dataset. The only distinct characteristic we have to distinguish our population is location. Stratified sampling in this case would be better suited for classification problems. The test data set represents 20% of the original data set and we ensure that both the train and test sets have an identical number of variables.
X_train, X_test, y_train, y_test= train_test_split(df.drop('price',axis=1),
df['price'],test_size=0.20,
random_state=5564)
df = X_train.copy()
df_test = X_test.copy()
df_train_labels = y_train.copy()
df_test_labels = y_test.copy()
We standardize our numerical variables within the train and test set so that they can be evaluated equally with our categorical variables. We perform this only after the split because if we did it before the train-test split, the pre-processed standardization would introduce a data leak if the data is split i.e. the global norm/mean would be indirectly introduced in the test set.
scaler = StandardScaler()
for column in ['year','odometer']:
df[column] = scaler.fit_transform(df[column].values.reshape(-1,1))
We perform the standardization of the train and test seperately to prevent any carryover.
std_Scaler = StandardScaler()
for column in ['year','odometer']:
df_test[column] = scaler.fit_transform(df_test[column].values.reshape(-1,1))
Lastly, we check that both our train and test sets have identical number of variables from the encoding and its values are comparable across the variables.
df.head()
| year | odometer | alfa-romeo | aston-martin | audi | bmw | buick | cadillac | chevrolet | chrysler | ... | coupe | hatchback | mini-van | offroad | other | pickup | sedan | truck | van | wagon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 226291 | -0.828528 | -0.125693 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 408968 | -0.828528 | 1.447829 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 92590 | -0.545020 | 0.273422 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17370 | -0.261513 | 2.212647 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 322921 | -0.686774 | 1.030101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
df_test.head()
| year | odometer | alfa-romeo | aston-martin | audi | bmw | buick | cadillac | chevrolet | chrysler | ... | coupe | hatchback | mini-van | offroad | other | pickup | sedan | truck | van | wagon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 228759 | -0.969290 | 0.414029 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 386312 | 0.449902 | 0.142840 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7466 | 0.875660 | -0.848805 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 415324 | 0.449902 | 0.047126 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 136507 | -4.659189 | -1.324774 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 56 columns
Our linear regression model performs equally well (46% accuracy) on both train and test with nine variables, indicating that overfitting is not an issue. We use this as a comparison of performance for our Random Forest with its default n_estimators = 100, which yields a higher accuracy on training (96.7%) than on test (78%). Our MAE and RMSE scores tend to be relatively high but improves with the Random Forest model. We can reduce the errors through feature engineering, applying other algorithms, and model hyper parameter tuning.
Next, we select our top 4 features using the feature_importance(). While not intuitive at first, we can observe that year, odometer, type and manufacturer perform the best. Type and manufacturer perform well cumulatively as an attribute in its entirety rather than by single variables as a result of our one-hot encoding. Instead of type, we could also try predicting with year, odometer, manufacturer, and drive. But for the purposes of this practical since our accuracy values are not priority, we can continue to test our models with the initially decided four variables.
from sklearn.linear_model import LinearRegression
lrmodel = LinearRegression()
lrmodel.fit(df,df_train_labels)
y_pred = lrmodel.predict(df_test)
Acc = pd.DataFrame(index=None, columns=['Model','Mean Absolute Error','Root Mean Squared Error','Accuracy on Traing set','Accuracy on Testing set'])
name = 'Linear Regression'
MAE = round(metrics.mean_absolute_error(df_test_labels,y_pred),2)
RMSE = np.sqrt(metrics.mean_squared_error(df_test_labels, y_pred))
ATrS = lrmodel.score(df,df_train_labels)
ATeS = lrmodel.score(df_test,df_test_labels)
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
/var/folders/sr/mv84b_gn0x599hl3rl_bfpyr0000gn/T/ipykernel_18273/3362433715.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
Acc
| Model | Mean Absolute Error | Root Mean Squared Error | Accuracy on Traing set | Accuracy on Testing set | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 7369.94 | 10248.662828 | 0.384282 | 0.385226 |
Our linear regression model performs worse with lesser variables, which is understandable. Although our MAE and RMSE remain fairly consistent despite lesser variables, which is a good indicator because it means the number of features is not the problem.
For our random forest with four variables, we will implement model hyperparameter tuning for cross validation to see if this performs better at least against the linear regression.
We store our possible hyperparameter distributions into a dictionary to be passed to RandomizedSearchCV. We are selecting to narrow down only 2 hyperparameters (n_estimators and min. sample split) because we have already defined the maximum number of features.
For our n_estimators, we select a random number between 4 and 200 and a uniform distribution for the minimum number of splits.
model_params = {
'n_estimators': randint(4,200),
'min_samples_split': uniform(0.01, 0.199)
}
We call the RF() and set up our Randomized Search by selecting the model, passing on our model parameters, selecting the number of models, and number of folds for it to cross validate. Each iteration uses a new model trained on a new draw from our dictionary of parameters. The number of folds determines how many times it will train each model on a different subset of data to improve model quality. The total number of models random search trains is then the number of iterations multiplied by folds. The output is the best set of hyperparameters found from all its models to be used. This is a time consuming process so brace yourself.
rf2 = RandomForestRegressor()
clf = RandomizedSearchCV(rf2, model_params, n_iter=20, cv=5, random_state=5564)
model = clf.fit(df,df_train_labels)
from pprint import pprint
pprint(model.best_estimator_.get_params())
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 0.015450131046387306,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 54,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
Use our tuned model to predict our test set.
y_pred = model.predict(df_test)
Acc = pd.DataFrame(index=None, columns=['Model','Mean Absolute Error','Root Mean Squared Error','Accuracy on Traing set','Accuracy on Testing set'])
name = 'Random Forest Regressor'
MAE = round(metrics.mean_absolute_error(df_test_labels,y_pred),2)
RMSE = np.sqrt(metrics.mean_squared_error(df_test_labels, y_pred))
ATrS = model.score(df,df_train_labels)
ATeS = model.score(df_test,df_test_labels)
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
/var/folders/sr/mv84b_gn0x599hl3rl_bfpyr0000gn/T/ipykernel_18273/974868182.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Acc = Acc.append(pd.Series({'Model':name,'Mean Absolute Error': MAE,'Root Mean Squared Error': RMSE,'Accuracy on Traing set':ATrS,'Accuracy on Testing set':ATeS}),ignore_index=True )
Acc
| Model | Mean Absolute Error | Root Mean Squared Error | Accuracy on Traing set | Accuracy on Testing set | |
|---|---|---|---|---|---|
| 0 | Random Forest Regressor | 6823.82 | 9771.298524 | 0.448848 | 0.441162 |
Our Random Forest performs better than our Linear Regression in terms of accuracy across both sets although our Random Forest seems to be overfitting given the discrepancy between accuracies in training and test. On a positive note, our MAE and RMSE do not deviate too much from the linear regression model, which is a good indicator because it is not a problem with our model either but it could be a problem with removing variables (especially as we compare with our Random Forest with nine variables). We can likely reduce our error through feature engineering, by transforming/scaling our features or managing the outliers.
fig, ax = plt.subplots(figsize=(20,15))
plt.scatter(df_test_labels, y_pred)
<matplotlib.collections.PathCollection at 0x144ebf460>